Bootstrapping Parallel Corpora

نویسندگان

  • Chris Callison-Burch
  • Miles Osborne
چکیده

We present two methods for the automatic creation of parallel corpora. Whereas previous work into the automatic construction of parallel corpora has focused on harvesting them from the web, we examine the use of existing parallel corpora to bootstrap data for new language pairs. First, we extend existing parallel corpora using co-training, wherein machine translations are selectively added to training corpora with multiple source texts. Retraining translation models yields modest improvements. Second, we simulate the creation of training data for a language pair for which a parallel corpus is not available. Starting with no human translations from German to English we produce a German to English translation model with 45% accuracy using parallel corpora in other languages. This suggests the method may be useful in the creation of parallel corpora for languages with scarce resources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora

Most work on extracting parallel text from comparable corpora depends on linguistic resources such as seed parallel documents or translation dictionaries. This paper presents a simple baseline approach for bootstrapping a parallel collection. It starts by observing documents published on similar dates and the cooccurrence of a small number of identical tokens across languages. It then uses fast...

متن کامل

Improving Machine Translation Performance Using Comparable Corpora

The overwhelming majority of the languages in the world are spoken by less than 50 million native speakers, and automatic translation of many of these languages is less investigated due to the lack of linguistic resources such as parallel corpora. In the ACCURAT project we will work on novel methods how comparable corpora can compensate for this shortage and improve machine translation systems ...

متن کامل

Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM

We present a method capable of extracting parallel sentences from far more disparate “very-non-parallel corpora” than previous “comparable corpora” methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations ...

متن کامل

Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E

We present a method capable of extracting parallel sentences from far more disparate “very-non-parallel corpora” than previous “comparable corpora” methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations ...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003